Darwin improvements: indexing scaling, retention, analytics dashboard, tests#39
Open
rajivml wants to merge 9 commits intofeature/darwinfrom
Open
Darwin improvements: indexing scaling, retention, analytics dashboard, tests#39rajivml wants to merge 9 commits intofeature/darwinfrom
rajivml wants to merge 9 commits intofeature/darwinfrom
Conversation
…, tests
Indexing scaling (NUM_INDEXING_WORKERS > 1 now safe):
- Per-cc-pair Postgres advisory lock prevents same-cc-pair concurrent runs
- Scheduler-side per-DocumentSource cap (INDEXING_PER_SOURCE_CAP, default 1)
defers over-cap NOT_STARTED attempts in update.py rather than fail-fast
- Per-attempt indexing_priority column lets manual triggers jump the queue
DB retention (new daily Celery beat at 08:00 UTC, danswer/db/retention.py):
- 6 policies: kombu_message, task_queue_jobs, index_attempt (opt-in,
keep-last-N), permission_sync_run, usage_reports, chat
- Batched DELETEs (5000/batch) under a single advisory lock + ANALYZE after
large purges. Rollback-before-unlock pattern prevents the lock from getting
stranded on a failed-transaction state.
- Chat retention is FK-safe: cleans search_doc orphans + file_store rows +
Postgres LO blobs attached to chat_message.files. SELECT FOR UPDATE on
session rows blocks new chat_message inserts during the batch transaction.
- One-shot CLI: scripts/cleanup_stale_db.py --dry-run / --policy=...
Analytics dashboard (community module, parallel to the EE one):
- New /admin/analytics page using Tremor (already in deps): KPI tiles +
AreaCharts + BarList. Day/Month granularity toggle, date range picker,
strict NPS computation (likes + resolved vs dislikes + needs_help).
- 6 backend endpoints under /api/analytics/admin/{query, user, danswerbot,
total-docs, docs-per-source, slack-channels}.
- analytics_daily_rollup table + checkpoint-based daily Celery beat at
07:30 UTC (30 min before retention sweep) so the dashboard survives chat
retention deletes. Backfill CLI: scripts/backfill_analytics_rollup.py.
Slackbot resolved-button feedback (NEW DB write):
- handle_followup_resolved_button now records chat_feedback with
predefined_feedback='resolved'. Threaded message_id through
build_follow_up_resolved_blocks so the second resolved button has it too.
- Powers the strict-NPS calculation on the analytics dashboard.
Indexing-status admin page UI (improvements):
- Status filter dropdown (success / failed / in_progress / not_started /
paused), search-by-connector-name with icon, bulk pause/re-enable actions
in a dedicated row, pagination at 10/page, "Clear filters" button.
Connector improvements:
- Split monolithic Salesforce admin page into sf-account and sf-kbarticles
with shared ConnectedApp credentials.
- New github-files connector for indexing repo file contents.
- Slack connector accepts channel IDs alongside channel names.
- Per-cc-pair credential edit forms across multiple connector pages.
DB migrations (4 new):
- 9d02a9a5ce39: indexing-status perf indexes (task_queue_jobs, index_attempt)
- fd307e9ecc9b: index_attempt.indexing_priority column + supporting index
- b5d3f1a9e7c2: chat UI perf indexes (chat_message.chat_session_id,
chat_session.user_id) — CREATE INDEX CONCURRENTLY for online deploy
- c8a4e2f9d1b3: analytics_daily_rollup table
Test infrastructure (new):
- scripts/seed_test_data.py — auto-generated tag-isolated fixtures with
configurable knobs (--days, --chats-per-day, feedback distribution,
--with-old-data, --with-search-docs, etc.)
- scripts/test_analytics_e2e.py — analytics + chat retention orchestrator
- scripts/test_features_e2e.py — priority + index_attempt retention +
permission_sync + resolved-feedback DB write
- scripts/test_celery_jobs_smoke.py — broker → worker plumbing check
via .delay() against fresh dummy data
Documentation:
- AGENTS.md + CLAUDE.md (new) — agent operating notes for this fork
- TESTING.md (new) — test scripts + manual UI checklist
- CONTRIBUTING.md — sections on indexing scaling, per-source cap, retention
env vars, analytics rollup, testing pipeline, all stress-test profiles
Kubernetes:
- env-configmap.yaml documents the new INDEXING_PER_SOURCE_CAP,
RETENTION_DAYS_*, and ANALYTICS_LATE_FEEDBACK_BUFFER_DAYS env vars
(all blank = use code defaults).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uff E402 + prettier CI pre-commit was failing on the prior commit (6cbdb06). Five hooks needed satisfying: - black: 20 files reformatted (line wrapping, trailing commas). - reorder-python-imports: 9 files reordered (alphabetical within blocks). - autoflake: removed unused imports / variables. - ruff E402 (real bug, not just style): 6 late module-level imports in backend/danswer/db/index_attempt.py were trailing the new advisory-lock helpers (lines 89-94). Hoisted them to the top with the other imports — no actual circular dep, the late placement was incidental. - prettier: 8 frontend files reformatted (trailing commas, line breaks). Verified after the fixes: - pre-commit run --from-ref feature/darwin --to-ref HEAD: all 5 hooks pass. - mypy clean on the touched backend modules (retention, analytics_rollup, analytics, server/analytics/api, celery_app, update, run_indexing, index_attempt). - Pre-existing tsc errors in sf-account/sf-kbarticles pages are unchanged (not introduced by formatters; will be fixed in a follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `deployment/kubernetes/analytics-bootstrap-job.yaml` — a Kubernetes
batch/v1 Job that runs ONCE in the `darwin` namespace and:
1. `alembic upgrade heads` to apply the new analytics_daily_rollup
migration (and any other pending revisions). Idempotent.
2. `python scripts/backfill_analytics_rollup.py` to walk every
historical date that still has chat data and populate the rollup
table + checkpoint. Idempotent via INSERT…ON CONFLICT(date).
Apply ONCE, after the new backend image with the PR's code rolls out
to api-server / background-deployment, and BEFORE the next 08:00 UTC
retention sweep on a fresh DB. After this Job completes, the daily
Celery beat task at 07:30 UTC takes over.
Why a Job (not Deployment / Pod):
- Deployment auto-restarts on container exit — wrong for one-time work.
- Bare Pod doesn't track success / failure cleanly.
- Job has run-to-completion + retry-on-failure (backoffLimit=3) +
auto-cleanup (ttlSecondsAfterFinished=3600). Standard K8s pattern
for one-shot maintenance.
Mirrors the api-server's image / env / volumes:
- Same image (placeholder: vha-119 — bump to the post-merge tag).
- POSTGRES_USER + POSTGRES_PASSWORD via danswer-secrets.
- envFrom env-configmap (POSTGRES_HOST, encryption keys, etc.).
- dynamic-pvc + file-connector-pvc volumes (defensive parity).
`activeDeadlineSeconds: 1800` caps the Job at 30 minutes — generous
even for very large chat histories. `restartPolicy: OnFailure` retries
within the same Pod before backoffLimit kicks in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CONTRIBUTING.md was carrying ~120 lines of testing how-to (orchestrator fast paths, stress-test profiles, edge cases, knob reference, etc.) that properly belong in TESTING.md. CONTRIBUTING.md is now setup-focused: guidelines, local setup, env vars, formatting, release process — with a short pointer to TESTING.md for the testing details. TESTING.md gains four new sections that previously lived in CONTRIBUTING.md: - Stress-test profiles (Medium / Heavy / Massive single-line variants) - "What 'stress' actually exercises" caveat - Edge-case scenarios (all-positive, all-negative, slackbot-only) - Quick state check after seeding (psql one-liner) - Knob reference table CONTRIBUTING.md shrank from 705 → 601 lines; TESTING.md grew from 260 → 334 lines. Net ~30 lines deduped. No content lost — only relocated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Why: * Per-source cap leaked because `kickoff_indexing_jobs` only counted DB-IN_PROGRESS rows; attempts queued in Dask but not yet IN_PROGRESS slipped past the cap, letting 2+ same-source attempts run at once and causing the higher-priority queued attempt to lose to a lower- priority one that won the worker race. * Connector deletion API had no in-flight dedup. Multiple "Delete connector" clicks queued parallel `cleanup_connector_credential_pair_task` invocations that each hit `SELECT ... FOR UPDATE NOWAIT` over the same documents, retried 10x30s, and all timed out. What: * `update.py`: extracted `_build_running_view` and `_evaluate_dispatch_for_attempt` pure helpers; the view now folds both DB-IN_PROGRESS and existing_jobs (filtered to non-terminal rows) into per-source / per-cc-pair accounting. Added scheduler-side per-cc-pair collision guard alongside the existing per-source cap. * `run_indexing.py`: lock-contention path reverts attempts to NOT_STARTED instead of writing FAILED rows; rollback-before-unlock pattern in finally to handle aborted-transaction unlock failures. * `connector_credential_pair.py`: added `try_acquire_deletion_lock` / `release_deletion_lock` (b"DELE" namespace, distinct from b"INDX"). * `celery_app.py`: `cleanup_connector_credential_pair_task` now acquires the deletion advisory lock at entry and releases via rollback-before-unlock in finally. Skips work + returns 0 on contention. * `administrative.py`: `/admin/deletion-attempt` returns HTTP 409 if a cleanup task for this cc-pair is already live in `task_queue_jobs`. * `web/.../status/page.tsx`: added Refresh button (with loading state driven by SWR `isValidating`) above the connectors table. Tests: * 22 unit tests in `tests/unit/danswer/background/test_indexing_scheduler.py` + `tests/unit/danswer/db/test_deletion_lock_keys.py`. Includes randomized fuzz (200 iterations) and a regression guard that simulates the pre-fix codepath to confirm test sensitivity. * `scripts/test_scheduler_e2e.py`: 9 phases against live Postgres (priority sort, cap-leak fix, per-cc-pair guard, completed/FAILED in_flight not consuming cap, no double-counting, 20-iteration soak). * `scripts/test_deletion_lock_e2e.py`: 6 phases against live Postgres (lock primitive cross-session, held-lock blocks task, lock release on success/exception, 6-thread race, API dedup logic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two real bugs that fail `next build`'s lint stage on a clean checkout (severity:error, not just warnings): * `chat/lib.tsx::useScrollonStream` was declared `async`, which silently broke the hook contract — React Hooks cannot be called inside an async function. The function never actually awaits anything; the `async` keyword was a copy-paste error. Removing it fixes 6 rules-of-hooks errors (4 useRef + 2 useEffect calls). * `WelcomeModal.tsx::_WelcomeModal` violated the React component naming rule (must start with an uppercase letter). The leading underscore caused the lint hook-rules visitor to refuse to treat the function as a component, producing 6 rules-of-hooks errors (1 useRouter + 4 useState + 1 useEffect). Renamed to `WelcomeModalContent` (kept distinct from the wrapper's exported `WelcomeModal`) and updated the single import site. Pre-existing react-hooks/exhaustive-deps and no-img-element warnings are intentionally left untouched in this commit — they don't fail the build and address them is out of scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were latent in the codebase: yup `Yup.object().shape({...})`
was producing inferred shapes that didn't match the declared TS
interfaces, so `next build`'s type-check refused to compile after
the rules-of-hooks errors were cleared.
Three files, four schemas:
* `lib/types.ts` — `GithubConfig.repo_name: string` → `?: string`.
The Github connector form's helper text already says "leave blank
to index every repo the access token can see under this owner",
the yup schema doesn't `.required()` it, and existing call sites
use `(values.repo_name || "").trim()`. The TS type was the only
thing claiming it was always present.
* `components/admin/connectors/ConnectorTitle.tsx` — the only
consumer that read `repo_name` directly now renders just the
owner when name is blank (instead of "owner/undefined").
* `admin/connectors/sf-account/page.tsx` (3 schemas) — yup schemas
were missing the optional `sf_credential_kind` discriminator and
the optional `requested_objects` config field, causing the
inferred Shape<> to not match `SalesforceCredentialJson` /
`SalesforceConfig`.
* `admin/connectors/sf-kbarticles/page.tsx` (2 schemas) — same
`sf_credential_kind` fix as sf-account.
`tsc --noEmit` clean after these changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues that surfaced when applying the Job to AKS: 1. backend/Dockerfile — only `force_delete_connector_by_id.py` was being copied into `/app/scripts/`, so the bootstrap Job's `python scripts/backfill_analytics_rollup.py` failed with `[Errno 2] No such file or directory`. Add the backfill script to the runtime image. Safe — it's idempotent and admin-invoked. 2. deployment/kubernetes/analytics-bootstrap-job.yaml — `set -euo pipefail` failed with "Illegal option -o pipefail" because the image's `/bin/sh` is dash (Debian slim base), not bash. The script has no pipes anyway, so plain `set -eu` is sufficient and portable across every POSIX sh. After this commit, build + push a new backend image tag and bump the Job's `image:` to it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /admin/indexing/status page polled all cc-pairs every 10s, which
hurts environments with hundreds of (mostly disabled) connectors.
Three coordinated wins:
1. Backend: `/admin/connector/indexing-status` accepts a new optional
`disabled` query param. `disabled=false` returns enabled connectors
only, `disabled=true` returns disabled, omitted = all. Filter is
applied after the existing bulk-fetch (the ~400-row query is cheap;
the win is in the JSON response size and downstream rendering).
2. Frontend: page now defaults to "Enabled only" via a Show dropdown
(Enabled / Disabled / All), passing the filter as part of the SWR
cache key. Environments with mostly-paused historical connectors
stop shipping them over the wire on every poll.
3. Frontend SWR options:
- `refreshInterval`: 10s -> 30s. 10s was unnecessarily aggressive
for an admin overview page.
- `refreshWhenHidden: false`. Backgrounded admin tabs stop polling.
- `revalidateOnFocus: true`. Stale view re-fetches when the tab
regains focus.
4. Refresh button loading state is now bound to a manual-refresh flag
instead of `isValidating`, so the spinner only spins when the user
clicks Refresh - not on every background poll.
Also bumps the analytics-bootstrap Job image to vha-121 to align with
the upcoming backend rebuild that ships scripts/backfill_analytics_rollup.py
(per the previous commit's Dockerfile change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-area enhancement PR landing the indexing-scaling, DB-retention, and admin-analytics work along with the supporting test infrastructure and connector / UI improvements built up across the Darwin development cycle.
NUM_INDEXING_WORKERS > 1is now safe): per-cc-pair Postgres advisory lock + scheduler-side per-DocumentSource concurrency cap + per-attempt indexing priority./admin/analytics(Tremor): KPI tiles + AreaCharts + BarList, strict NPS calculation, Day/Month granularity toggle, daily rollup table populated 30 minutes before the retention sweep so historical analytics survive chat data deletion.chat_feedbackrows (powers the strict-NPS feedback signal).sf-account/sf-kbarticles, new GitHub-Files connector, Slack channel-ID support, in-place credential edit forms.index_attempt.indexing_prioritycolumn,analytics_daily_rolluptable.backend/scripts/, all using tag-isolated dummy data.AGENTS.md,CLAUDE.md,TESTING.md(new) +CONTRIBUTING.mdupdates covering the new env vars, scaling, retention windows, and stress-test profiles.deployment/kubernetes/env-configmap.yaml(all blank = defaults).What changed (by area)
Backend
backend/danswer/db/retention.py(new) — six retention policies under one daily Celery beat task. Advisory lock + batched per-policy DELETEs. Rollback-before-unlock pattern in thefinallyso a failed-transaction state can't strand the lock on a pooled connection.backend/danswer/db/analytics.py+backend/danswer/db/analytics_rollup.py+backend/danswer/server/analytics/api.py(new) — community-side analytics module parallel to the EE one. Endpoints under/api/analytics/admin/.... Rollup table populated by a checkpoint-driven daily task.backend/danswer/configs/indexing_concurrency.py(new) +backend/danswer/background/update.py— scheduler-side per-DocumentSource cap. Over-cap NOT_STARTED attempts stay scheduled; no spurious FAILED rows.backend/danswer/db/index_attempt.py+backend/danswer/background/indexing/run_indexing.py— per-cc-pair advisory lock prevents same-cc-pair concurrent runs.backend/danswer/danswerbot/slack/handlers/handle_buttons.py+backend/danswer/danswerbot/slack/blocks.py— resolved button now recordschat_feedbackwithpredefined_feedback='resolved'.backend/danswer/background/celery/celery_app.py— registersrun_analytics_rollup_task(07:30 UTC) +run_retention_policies_task(08:00 UTC) beat entries.backend/alembic/versions/9d02a9a5ce39— task_queue_jobs / index_attempt indexes for the indexing-status page.backend/alembic/versions/fd307e9ecc9b— addsindex_attempt.indexing_priority+ supporting index.backend/alembic/versions/b5d3f1a9e7c2—chat_message(chat_session_id)andchat_session(user_id)btree indexes viaCREATE INDEX CONCURRENTLY.backend/alembic/versions/c8a4e2f9d1b3—analytics_daily_rolluptable.Frontend
web/src/app/admin/analytics/page.tsx(new) — full analytics dashboard with date range picker, KPIs, charts, BarList.web/src/app/admin/indexing/status/CCPairIndexingStatusTable.tsx— status filter dropdown, name search, bulk action row, pagination=10, "Clear filters" button.web/src/components/admin/Layout.tsx— sidebar entry for the analytics page.web/src/app/admin/connectors/sf-account/page.tsx+sf-kbarticles/page.tsx— split from monolithic Salesforce page.web/src/app/admin/connectors/github-files/page.tsx(new) — GitHub-Files setup UI.web/src/app/admin/connector/[ccPairId]/CredentialSection.tsx(new) — generic in-place credential edit.Infra / docs
deployment/kubernetes/env-configmap.yaml— adds documented blank entries forINDEXING_PER_SOURCE_CAP,RETENTION_DAYS_*,ANALYTICS_LATE_FEEDBACK_BUFFER_DAYS.AGENTS.md,CLAUDE.md,TESTING.md(new);CONTRIBUTING.mdsubstantially expanded..gitignore— Claude Code/exportoutputs + strayrequestdata.jsondebug payloads.What's tested
Positive scenarios
Run end-to-end via the new orchestrator scripts under
backend/scripts/:get_not_started_index_attemptsreturns rows inpriority DESC, time_created ASCorder;update_index_attempt_priorityclamps to ceiling=100 and refuses on IN_PROGRESS rows. (test_features_e2e.pyPhase 1)RETENTION_DAYS_INDEX_ATTEMPT=60andKEEP_LAST_N=20, 25 SUCCESS rows aged 70-94d collapse to the 20 most recent; the 5 oldest are deleted. Also serves as a regression check on the column's case-sensitive uppercase status storage. (test_features_e2e.pyPhase 2)in_progress— 5 terminal (success/failed) rows aged 90d are deleted, 3in_progressrows of the same age survive. (test_features_e2e.pyPhase 3)create_chat_message_feedback(predefined_feedback='resolved', ...)writes exactly one row with the right shape against a slackbot-style session. (test_features_e2e.pyPhase 4)analytics_daily_rollupwith non-zero sums; checkpoint advances to today. (test_analytics_e2e.pyPhase 2-3)test_analytics_e2e.pyPhase 3)test_analytics_e2e.pyPhase 5)test_analytics_e2e.pyPhase 6)test_analytics_e2e.pyPhase 7)run_analytics_rollup_task.delay()andrun_retention_policies_task.delay()complete in <2s withstate=SUCCESSand apply visible side effects. (test_celery_jobs_smoke.py)__test_*__tagged rows; no risk to real data.Negative / edge / failure scenarios
.delay()'d retention task succeeds quickly but skips its work; 5 seeded old chats remain in place (visible side-effect). (test_celery_jobs_smoke.pyresilience loop)pg_terminate_backend(pid)on the lock-holder restores normal operation; subsequent retention run deletes everything it should.finallyno longer strands the lock — verified by direct in-process call torun_retention_policies()after the rollback-before-unlock fix; no orphan lock left behind.now()drift between the before/after queries (frozen 28-day boundary in the orchestrator's Phase 5 keeps the assertion stable).INDEXING_PER_SOURCE_CAP=1produce one IN_PROGRESS at a time; the rest stay NOT_STARTED and get reconsidered each tick (no FAILED rows produced by capping)..value); a lowercase regression silently no-ops the policy. Caught + verified.—instead of NaN/error.Manual UI smoke (per
TESTING.md)Verify after
python scripts/test_analytics_e2e.py --yes --keep-data:/admin/analyticsrenders all KPIs + 3 charts with non-zero numbers./admin/indexing/status: source filter, status filter, name search, bulk pause/re-enable, pagination.Deployment notes
alembic upgrade head— applies all four new migrations (chat UI perf indexes useCREATE INDEX CONCURRENTLYand won't block writes).python scripts/backfill_analytics_rollup.py— one-time, populatesanalytics_daily_rollupfrom existing chat data BEFORE the next retention sweep deletes anything. Skipping this leaves the dashboard at zero for historical ranges until the daily task accumulates enough days.background-deploymentso Celery beat picks up the new schedule entries (run-analytics-rollupandrun-retention).The new env vars (all defaulted sensibly):
INDEXING_PER_SOURCE_CAP(default 1)RETENTION_DAYS_KOMBU(7) /RETENTION_DAYS_TASK_QUEUE(30) /RETENTION_DAYS_INDEX_ATTEMPT(0=disabled) /RETENTION_KEEP_LAST_N_INDEX_ATTEMPTS(20) /RETENTION_DAYS_CHAT(30) /RETENTION_DAYS_PERMISSION_SYNC(30) /RETENTION_DAYS_USAGE_REPORTS(90) /RETENTION_BATCH_SIZE(5000) /RETENTION_MAX_BATCHES(200)ANALYTICS_LATE_FEEDBACK_BUFFER_DAYS(default 2; MUST be <RETENTION_DAYS_CHAT)Known follow-ups (not in this PR)
PermissionSyncRun.update_typeand.statuscolumns lacknative_enum=Falsein the model; SQLAlchemy 2.x bulk-insert paths fail with a::permissionsyncjobtypecast error against the varchar columns. Single-row ORM inserts may work; bulk inserts don't. Workaround already in place via raw SQL where needed (seed factory). Worth a separate model-fix PR — no migration needed (column type is alreadyvarchar).User-facing admin capabilities (call-out)
A few of the new behaviours that affect day-to-day operator workflows, called out separately from the deeper technical changes above:
Bump priority of a queued indexing attempt — without code changes or restarts. A manual
Re-Indexfrom the cc-pair page can now be given a higher priority so it jumps ahead of auto-scheduled work in the Dask queue. Implementation:PATCH /admin/index-attempt/{index_attempt_id}/priority(backend/danswer/server/documents/connector.py:646) callsupdate_index_attempt_priority(clamps to 0–100; refuses on rows that have already moved out ofNOT_STARTED).web/src/app/admin/connector/[ccPairId]/IndexingAttemptsTable.tsxshows a priority column + bump control;ReIndexButton.tsxlets you set the priority at trigger time.index_attempt.indexing_priorityinteger column +(status, indexing_priority, time_created)index (Alembicfd307e9ecc9b).Edit credentials in place — no need to delete and recreate the connector. The new
web/src/app/admin/connector/[ccPairId]/CredentialSection.tsxcomponent renders a per-cc-pair credential edit form with the field-set inferred from the connector source. Integrated across all touched connector pages:sf-account,sf-kbarticles,github,github-files,confluence,jira,slack,sharepoint. Backed by the existingPUT /admin/credential/...API, exposed throughweb/src/lib/credential.ts.Indexing-status page filters and bulk actions (already mentioned above, summarised here for completeness): source-type dropdown + status dropdown + name search + bulk pause/re-enable buttons + 10/page pagination + "Clear filters" button. All filter state resets pagination so bulk actions never operate on hidden rows.
Analytics dashboard at
/admin/analytics(already covered) — first community-edition analytics page in this fork; all six existing-vs-new endpoints feed it.Concurrency hardening follow-up (commit
e31f2104)Three independent concurrency bugs surfaced while validating the indexing-scaling work; all fixed in a single follow-up commit on the same branch.
1. Per-source cap leak (the "two slack runs at once + lower-priority wins" bug)
Symptom: with
INDEXING_PER_SOURCE_CAP=1and 7 Slack cc-pairs (one priority-bumped to 20), users observed two Slack runs going simultaneously, and the priority-bumped attempt sometimes lost the worker race to a priority-0 attempt.Root cause:
kickoff_indexing_jobsbuiltrunning_per_sourcepurely from DB rows withstatus=IN_PROGRESS. Attempts the scheduler had already submitted to Dask but which were still NOT_STARTED in the DB (the queue / worker-spinning-up window — typically a few seconds) were invisible to the cap. A subsequent tick saw 0 slack IN_PROGRESS, fell through to the next slack candidate, and submitted it. Both were now in Dask's queue; whichever the worker pulled first won, regardless of priority.Fix (
backend/danswer/background/update.py):_build_running_viewand_evaluate_dispatch_for_attemptpure helpers (also makes the logic unit-testable without mocking SQLAlchemy).existing_jobs(filtered to non-terminalIndexAttemptrows viastatus.notin_([SUCCESS, FAILED])) intorunning_per_sourceANDin_progress_cc_pair_keys, alongside the DB IN_PROGRESS query.accounted_attempt_idsdedups so an attempt that's in both lists isn't counted twice.backend/danswer/background/indexing/run_indexing.py) now reverts to NOT_STARTED instead of writing FAILED rows; rollback-before-unlock pattern infinallyto handle aborted-transaction unlock failures.2. Connector-deletion lock storm (API + worker)
Symptom: 6
cleanup_connector_credential_pair_taskinvocations all failing within a 35 ms window withFailed to acquire locks after 10 attempts for documents: [...140 doc IDs...]. Each had spent the full_NUM_LOCK_ATTEMPTS × _LOCK_RETRY_DELAY= 5 minutes retryingSELECT ... FOR UPDATE NOWAITover the same docs.Root cause:
/admin/deletion-attempt(backend/danswer/server/manage/administrative.py) had no in-flight dedup. Six clicks of "Delete connector" (or any retry loop) all calledapply_asyncand queued parallel cleanup tasks. They raced over the same documents, retried, all timed out together.Fix (3 layers, defense in depth):
/admin/deletion-attemptnow returns HTTP 409 if a cleanup task for the same cc-pair is already live intask_queue_jobs(viaget_latest_task+check_task_is_live_and_not_timed_out). Most user-friendly path.cleanup_connector_credential_pair_taskacquires a per-cc-pair advisory lock at entry (new helperstry_acquire_deletion_lock/release_deletion_lockinbackend/danswer/db/connector_credential_pair.py, namespaceb"DELE"— distinct from the indexingb"INDX"namespace). On contention: log + return 0. Released via rollback-before-unlock infinally. This is the safety net for any caller that bypasses the API endpoint.prepare_to_modify_documents) are unchanged — they still serve their original purpose of preventing concurrent indexer / deletion writes to the same documents.3. UI: Refresh button on connectors-status page
web/src/app/admin/indexing/status/page.tsx— added a Refresh button (icon:FiRefreshCw) above the connectors table that calls SWRmutate()to force an immediate refetch. Loading state driven byisValidating. The existing 10 s background poll is unchanged; this is a manual nudge between ticks for users who don't want to wait.What's tested (follow-up commit)
Unit tests —
backend/tests/unit/danswer/background/test_indexing_scheduler.py— 16 tests:_SimStatesimulator that models Dask flip / finish / crash probabilities, asserting per-source cap and cc-pair invariants every tick.test_buggy_view_without_fix_leaks_cap_regression_guard— explicitly simulates the pre-fix codepath to prove the harness is sensitive enough to detect a regression.danswer/db/test_deletion_lock_keys.py— 6 tests for the deletion advisory-lock key derivation: namespace distinct from indexing lock, deterministic, no small-space collisions, within Postgres bigint range.22 unit tests total; ~5 s runtime.
E2E tests —
backend/scripts/test_scheduler_e2e.py— 9 phases against live Postgres with a recording fake Dask client:existing_jobsentries don't consume cap slots — exercises thestatus.notin_(...)SQL filter against live ORM.existing_jobsaren't double-counted.test_deletion_lock_e2e.py— 6 phases against live Postgres:task_queue_jobsSTARTED+recent → 409, SUCCESS → allow, ancient STARTED → allow (timed out).Both e2e suites: pass against live Postgres in ~5 s combined. Each script auto-cleans tagged data before and after; refuses to run if the production indexer is up (would race on seeded NOT_STARTED rows).
Pre-commit verification
All hooks pass on the commit (black, reorder-python-imports, autoflake, ruff, prettier).
🤖 Generated with Claude Code